fix(viz): improve dtype inference logic#12933
Conversation
20a4423 to
944339a
Compare
tests/utils_tests.py
Outdated
There was a problem hiding this comment.
Here we're making sure that all main datatypes are correctly identified both with and without null values.
Codecov Report
@@ Coverage Diff @@
## master #12933 +/- ##
==========================================
- Coverage 69.14% 66.90% -2.25%
==========================================
Files 1025 489 -536
Lines 48767 28674 -20093
Branches 5188 0 -5188
==========================================
- Hits 33718 19183 -14535
+ Misses 14915 9491 -5424
+ Partials 134 0 -134
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
I think we can first do a mapping based on types returned by |
There was a problem hiding this comment.
PANDAS_DTYPE_MAP = {
"date": GenericDataType.TEMPORAL,
"time": GenericDataType.TEMPORAL,
"datetime": GenericDataType.TEMPORAL,
"datetime64": GenericDataType.TEMPORAL,
"integer": GenericDataType.NUMERIC,
"floating": GenericDataType.NUMERIC,
"decimal": GenericDataType.NUMERIC,
"mixed-integer-float": GenericDataType.NUMERIC,
"boolean": GenericDataType.BOOLEAN,
}
def extract_dataframe_dtypes(df: pd.DataFrame) -> List[GenericDataType]:
"""Serialize pandas/numpy dtypes to generic types"""
return [
PANDAS_DTYPE_MAP.get(infer_dtype(df[col], skipna=True), GenericDataType.STRING)
for col in df.columns
]Maybe a manual inference of more complex type is not even needed since infer_dtype can already take care of NA's for us.
|
this one is in mysql ^ |
Good idea, I'll update to use this logic |
4959d52 to
b5f54f9
Compare
@junlincc the wacky formats in the table chart have now been fixed as of |
ktmud
left a comment
There was a problem hiding this comment.
LGTM. Agreed the fix for data panel should be in a separate PR.


SUMMARY
Currently query result formats are checked based on the
dtypeproperty of theDataFramecoming back from the database. This works for some types, but notintcolumn to becomeobject)object)datetime64[ns, UTC]on e.g. BigQuery when thedtypewas cast to string, causing it to be returned as a non-temporal type)This PR replaces the current inference logic with the
infer_dtypeutility offered by Pandas and adds tests for typical column types. This is really just a quick fix to make table chart column types work properly, as we're currently inferring column datatypes in three (!) different places in the codebase. A bigger refactor uniting these into one place needs to be done ASAP, but is left to a later SIP/PR. This also bumpssuperset-uito the latest version which includes a fix for the default temporal format: apache-superset/superset-ui#937AFTER
This screenshot shows

TIMESTAMP,DATETIMEandDATEcolumns on a BigQuery dataset on table chart with this fixBEFORE
The same chart prior to the PR

TEST PLAN
Local testing on various databases + new tests
ADDITIONAL INFORMATION